Language Guided Visual Perception

نویسنده

  • Mohamed H. Elhoseiny
چکیده

People typically learn through exposure to visual stimuli associated with linguistic descriptions. For instance, teaching visual concepts to children is often accompanied by descriptions in text or speech. This motivates the question of how this learning process could be computationally modeled. In this dissertation we explored three settings, where we showed that combining language and vision is useful for machine perception using images and videos. First, we addressed the question of how to utilize purely textual description of visual classes with no training images, to learn explicit visual classifiers for them. We propose and investigate two baseline formulations, based on regression and domain transfer that predict a classifier. Then, we propose a new constrained optimization formulation that combines a regression function and a knowledge transfer function with additional constraints to predict the classifier parameters for new classes. We also proposed kernelized models, which allow utilizing any kernel functions in the visual space and text space. We applied the models to predict visual classifiers for two fine-grained categorization datasets, and the results indicate successful predictions against several baselines. Second, we modeled searching for events in videos as a language and vision problem, where we proposed a zero-shot event detection method using multi-modal distributional semantic embedding of videos. Our zero-shot event detection model is built on top of distributional semantics and extends it in the following directions: (a) semantic embedding of multimodal information in videos (with focus on the visual modalities), (b) automatically determining relevance of concepts/attributes to a free text query, which could be useful for other applications, and (c) retrieving videos by free text event query based on their content. We validated our method on the TRECVID MED (Multimedia Event Detection) challenge. Using only the event title as a query, our method outperformed the state-of-the-art that uses big descriptions. Third, and motivated by the aforementioned results, we proposed a uniform and scalable setting to learn unbounded number of visual facts. We proposed models that can learn, not only objects, but also their actions, attributes and interactions with other objects, in one unified learning framework and in a never-ending way. The training data comes as structured facts in images, including (1) objects (e.g., ), (2) attributes (e.g.,), (3) actions (e.g., , and (4) interactions (e.g., ). We have worked on the scale of 814,000 images and 202,000 unique visual facts. Our experiments show the advantage of relating facts by the structure in the proposed models compared to four designed baselines on bidirectional fact retrieval. Defense Committee: Prof. Ahmed Elgammal (Chair), Prof. Casimir Kulikowski, Prof. Abdeslam Boularias, Prof. Abhinav Gupta (Carnegie Mellon University)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Corpus-Guided Framework for Robotic Visual Perception

We present a framework that produces sentence-level summarizations of videos containing complex human activities that can be implemented as part of the Robot Perception Control Unit (RPCU). This is done via: 1) detection of pertinent objects in the scene: tools and direct-objects, 2) predicting actions guided by a large lexical corpus and 3) generating the most likely sentence description of th...

متن کامل

Moving Stimuli Facilitate Synchronization But Not Temporal Perception

Recent studies have shown that a moving visual stimulus (e.g., a bouncing ball) facilitates synchronization compared to a static stimulus (e.g., a flashing light), and that it can even be as effective as an auditory beep. We asked a group of participants to perform different tasks with four stimulus types: beeps, siren-like sounds, visual flashes (static) and bouncing balls. First, participants...

متن کامل

Effect of Aerobic Training on Verbal Working Memory, Cognitive Flexibility and Visual Perception in Patients with Written Disorder

Introduction: Written disorder is the highest and most complex language skill disorder in humans, and its patients often have problems in executive functions such as verbal working memory, cognitive flexibility, and visual perception. Therefore, the present research aimed to determine aerobic training on verbal working memory, cognitive flexibility, and visual perception in patients with the wr...

متن کامل

Mental images and the Brain.

One theory of visual mental imagery posits that early visual cortex is also used to support representations during imagery. This claim is important because it bears on the "imagery debate": Early visual cortex supports depictive representations during perception, not descriptive ones. Thus, if such cortex also plays a functional role in imagery, this is strong evidence that imagery does not rel...

متن کامل

Language-guided visual processing affects reasoning: the role of referential and spatial anchoring.

Language is more than a source of information for accessing higher-order conceptual knowledge. Indeed, language may determine how people perceive and interpret visual stimuli. Visual processing in linguistic contexts, for instance, mirrors language processing and happens incrementally, rather than through variously-oriented fixations over a particular scene. The consequences of this atypical vi...

متن کامل

The Study of Perception and Expression of Nouns and Reliability of Two Visual Comprehension and Expression of Nouns Tests in Mild-Moderate Hearing Loss Children

  Background and Objective: Children with hearing loss demonstrate cognitive, communication, speech and language deficits. Poor organization in mental lexicon and reduction in vocabulary are the obvious consequences of hearing loss. The main objective of this study was to evaluate perception and expression of nouns, and test-retest reliability of two picture-pointing and picture-naming tests,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016